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1.  INTRODUCTION 


Suppose  that  we  are  given  a  real-valued  finite-horizon  performance  measure  Y  = 
f(X ’o,-Xi,  •  •  • ,  X„),  where  X  =  ( Xn  :  n  >  0)  is  a  Markov  chain  having  initial  distribution 
p  and  transition  matrix  P.  (We  assume  throughout  this  paper  that  the  state  space  5  is 
finite,  unless  otherwise  stated.)  The  standard  way  to  estimate  the  expected  performance 
a  =  EY  is  to  use  a  sample  mean  of  i.i.d.  replicates  of  the  r.v.  Y  generated  under  initial 
distribution  p  and  transition  matrix  P. 

However,  importance  sampling  offers  an  alternative.  Specifically,  let  v  and  K  be, 
respectively,  an  initial  distribution  and  transition  matrix  chosen  so  that  p(x)  >  0  implies 
v(x)  >  0  and  P(x,y)  >  0  implies  K(x,y )  >  0.  To  indicate  the  dependence  of  the  expecta¬ 
tion  operator  of  the  Markov  chain  X  upon  the  initial  distribution  and  transition  matrix, 
we  shall  write  Ep(- )  to  denote  the  expectation  operator  relative  to  initial  distribution  p 
and  transition  matrix  P,  whereas  Ek(-)  is  the  expectation  operator  in  which  the  initial 
distribution  and  transition  matrix  are  given  by  v  and  K,  respectively.  Then,  it  is  easily 
seen  that  a  =  EpY  can  be  written  as 


n— 1 


«=  JJ  P(Xi’Xi+l) 


(1.1) 


Xq 

=  £ 

=  EKYLn 


i= 0 


<V-  _  \(f1)xo)  TT  -P(xi,  Si+l) 

J(Xq,  .  .  .  ,Xn)  .  .  II  . 

K*°)  K(xi,xl+l ) 


n  — 1 


v(x  o)  n  k(x"x- h) 


i=0 


where 

r  _  l*(Xo)  rV  PjXi'Xi+i) 

”  «xo)  M  K(x„xt+1y 

The  r.v.  Ln  is  known  as  the  likelihood  ratio  (of  Ep(-)  relative  to  £*■(•)).  The  identity 
(1.1)  suggests  that  a  can  be  estimated  by  calculating  i.i.d.  replicates  of  the  r.v.  YLn 
generated  under  initial  distribution  v  and  transition  matrix  K.  This  technique  is  known 
as  the  method  of  (static)  importance  sampling  for  Markov  chains.  (See  Glynn  and  Iglehart 
(1989)  for  a  discussion  of  a  related  variant  of  importance  sampling  known  as  dynamic 
importance  sampling.) 

Our  goal  in  this  paper  is  to  show  that  this  importance  sampling  technique  is  always 
very  poorly  behaved  from  a  variance  standpoint  when  the  time  horizon  is  large.  In  fact, 
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we  will  show  that  for  a  wide  variety  of  performance  measures,  the  variance  essentially 
grows  exponentially  rapidly  in  the  length  n  of  the  time  horizon.  This  result  suggests 
that  (static)  importance  sampling  of  the  kind  described  above  is  typically  going  to  be 
(highly)  inefficient  when  the  time  horizon  is  large.  While  this  conclusion  is  suggested  by 
some  previous  analyses  (see,  for  example,  Glynn  (1987)),  our  current  treatment  permits 
a  rather  precise  quantitative  characterization  of  the  statistical  inefficiency  of  importance 
sampling  in  such  a  context. 

These  results  are  of  interest  in  at  least  three  different  problem  settings. 

Setting  1  (Variance  Reduction):  By  choosing  v  and  K  judiciously,  one  hopes  that  the 
estimator  obtained  via  importance  sampling  will  have  a  corresponding  variance  that  is 
significantly  lower  than  that  of  the  standard  estimator.  This  expectation  is  borne  out 
in  certain  applications  contexts.  (See,  for  example,  Goyal  et  al  (1990).)  However,  our 
results  suggest  that  one  needs  to  proceed  with  caution  in  applying  importance  sampling 
to  performance  measures  in  which  the  time  horizon  is  large. 

Setting  2  (Gradient  Estimation):  As  is  well  known  (see,  for  example,  Glynn  (1986a)  and 
Glynn  (1987)),  formula  (1.1)  underlies  the  likelihood  ratio  method  for  estimating  gradients 
of  mean  performance  measures  with  respect  to  vectors  of  continuous  decision  parameters. 
In  Section  6  of  this  paper,  we  will  discuss  the  implications  of  our  results  for  such  gradient 
estimation  problems.  Our  results  will  show  that  when  the  time  horizon  corresponding  to 
the  performance  measure  is  large,  there  is  a  best  possible  choice  for  K  and  we  will  identify 
the  optimal  transition  matrix. 

Setting  3  (Optimization):  The  likelihood  ratio  gradient  estimator  mentioned  above  can 
be  used  for  optimization  purposes.  Two  different  approaches  can  be  followed.  One  method 
involves  developing  a  stochastic  recursive  algorithm  (such  as  the  Robbins- Monro  stochas¬ 
tic  approximation  scheme)  which  is  driven  by  newly  generated  likelihood  ratio  gradient 
estimators  at  each  iteration  (see,  for  example  Glynn  (1986b)).  The  second  idea  is  to  use 
importance  sampling  to  generate  via  simulation,  at  a  single  point  in  the  decision  parameter 
space,  an  estimate  to  the  entire  response  surface.  Typically,  the  estimated  response  surface 
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will  be  (at  least)  twice  continuously  differentiable.  Thus,  one  cam  attempt  to  estimate  the 
optimizer  of  the  true  surface  by  applying  deterministic  optimization  algorithms  (such  as 
Newton’s  method)  to  the  estimated  response  surface  (see  Rubinstein  and  Shapiro  (1989) 
for  further  details).  It  turns  out  that  the  gradients  associated  with  the  estimated  response 
surface  are  precisely  the  likelihood  ratio  gradient  estimators  of  Setting  2.  In  Section  6, 
we  describe  this  connection  more  fully  and  discuss  the  implications  of  our  results  for  this 
approach. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2,  we  describe  a  formula 
for  the  variance  of  an  arbitrary  estimator  obtained  via  importance  sampling.  This  formula 
is  fundamental  to  our  subsequent  analysis.  In  Sections  3  through  6,  we  apply  the  formula 
to:  cumulative  costs  (Section  3),  terminal  costs  (Section  4),  steady-state  costs  (Section  5), 
and  likelihood  ratio  gradient  estimators  (Section  6).  Finally,  in  Section  7,  we  describe  the 
implications  of  the  theory  developed  in  this  paper  for  general  discrete-event  simulations. 


2.  A  FORMULA  FOR  THE  VARIANCE 

Given  that  we  apply  importance  sampling  with  initial  distribution  v  and  transition 
matrix  K  to  the  estimation  of  a,  the  variance  will  be  given  by 

var*[rin]  =  EkY2L\  -  ( EKYLn)2 
=  EKY2L2n-a2. 


(2.1) 


Since  a  is  independent  of  K,  our  goal  is  to  simplify  the  expression  for  E^Y2L\.  In  pursuit 
of  this  objective,  we  note  that 


(2.2) 


EkY2LI 


f  (x0,...,x„) 


v(xo) 


n 


P2(xj,z,+  i) 
K(x„xi+l) 


Let  G  =  (G(z,y)  :  x,y  6  S)  be  the  matrix  in  which  G(x,y)  =  P2(x,  y)/ K(x,  y)  when 
P(x,y)  >  0  and  G(x,y)  =  0  when  P(x,y)  =  0.  Assume  that  the  stochastic  matrix  P  is 
irreducible.  Then,  G  is  necessarily  irreducible.  Since  G  is  clearly  non-negative,  we  may  ap¬ 
ply  the  Perron-Frobenius  theory  for  non-negative  matrices  to  the  study  of  G.  In  particular 
(see  Karlin  and  Taylor  (1975)),  it  can  be  asserted  that  G  possesses  a  positive  eigenvalue 
A  =  A(G)  (known  sis  the  Perron-Frobenius  eigenvalue)  such  that  A  is  the  eigenvalue  of 
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maximum  (complex)  modulus.  Furthermore,  there  exists  precisely  one  linearly  indepen¬ 
dent  column  eigenvector  associated  with  A.  This  eigenvector  h  can  be  chosen  to  be  strictly 
positive  in  all  components.  Since  h  is  the  eigenvector  corresponding  to  A,  it  follows  that 


5 ZG(x>y)h(y )  =  *MX) 


for  x  £  S.  Hence, 


h(y) 

Xh(x) 


=  1. 


Let  R  =  ( R(x,y )  :  x,y  €  5)  be  the  matrix  in  which  R(x,y)  =  G(x,y)h(y)/(Xh(x)).  Then, 
R  is  non- negative  with  row  sums  equal  to  1,  and  is  hence  stochastic.  Furthermore,  R  is 
irreducible  since  G  is.  Noting  that 


G(x,y )  =  Xh(x)R(x,y)/h(y), 


it  is  evident  that 

(2.3)  n  "ri 

•=o  i=0 

Let  rj  =  (r](x)  :  x  €  5)  be  the  stochastic  vector  defined  by 


?7(x)  =  7  V(*)MX ), 


where  7  =  Ylz  At2(I)/l/(x)-  'Vith  the  aid  of  (2.3),  we  may  now  express  (2.2)  as 


n— 1 


EKY2Ll  =  7An  ^2  /2(*o,.-.,Xnb(xo)  JJ  R(xi,xl+i)h(x0)/h(xn) 


TO  ,...yZn 


1=0 


=  7 


f2(X  0,...,Xn) 


h(X  0) 
KXh) J  ’ 


where  £/*(•)  is  the  expectation  operator  associated  with  initial  distribution  77  and  tran¬ 
sition  matrix  R.  We  summarize  our  discussion  with  the  following  variance  identity;  the 
specialization  of  this  formula  to  r.v.’s  Y  that  are  additive  functionals  is  implicit  in  much 
of  the  large  deviations  discussion  given  in  Bucklew  (1990). 
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THEOREM  1.  If  P  is  irreducible,  then 


var K[YLn]  =  i\nER[Y2h(X0)/h(Xn)}  -  a2, 
where  7,  A,  h,  and  R  are  defined  as  above. 

Typically,  the  magnitude  of  Y  is  polynomial  in  n  (see  Sections  3  through  6).  Thus, 
the  variance  of  YLn  is  determined  by  the  exponential  behavior  of  An  (if  A  /  1).  Our  next 
proposition  tells  us  that  A  is  typically  strictly  greater  than  1.  Hence,  if  Y  is  of  polynomial 
order  in  n,  it  is  evident  that  var x\YLn\  is  basically  increasing  geometrically  fast  at  rate  A. 

PROPOSITION  1.  The  quantities  7  and  A  are  always  greater  than  or  equal  to  1. 
Also,  7  >  1  if  p  ^  v.  Furthermore,  suppose  P  is  irreducible.  Then,  A  >  1  if  P  ^  K. 

For  the  proof  of  Proposition  1,  see  the  Appendix.  Thus,  we  may  conclude  that  in  any 
non-trivial  importance  sampling  context  (i.e.  P  ^  K),  the  sequence  of  multipliers  7An  is 
growing  geometrically  fast.  Hence,  in  order  that  var^[KLn]  be  well-behaved  as  a  function 
of  n,  it  is  evident  that  ER[Y2h(Xo)/h(Xn)]  must  be  small.  For  example,  if  Y  =  1(A) , 
where  A  is  a  “rare  event”  under  i?,  var/<-|YLn]  can  still  be  of  moderate  size  (see  Cottrell 
et  al.  (1983)).  However,  as  we  will  see  in  Sections  3  through  6,  importance  sampling  in 
most  other  problem  settings  leads  to  geometric  growth  in  the  variance. 

3.  APPLICATION  1:  CUMULATIVE  COSTS 

Let  /  be  a  real-valued  function  defined  on  the  state  space  5  of  the  Markov  chain 
X.  Suppose  that  the  performance  measure  is  the  cumulative  cost  corresponding  to  the 
function  /,  namely 

Kn  =  E  A**)- 

fc  =  0 

Assume  P  is  irreducible  and  aperiodic.  Then,  Theorem  1  applies  that 
(3-1)  var*[rLn]  =  ER(Y2h(X0)/h(Xn))  -  a2n 

where  a„  =  EpYn.  Since  IFnl  <  n  ■  ||/||,  where  ||/||  =  max{|/(x)|  :  x  €  5},  it  is  evident 
that  aj;  =  0(n2)  as  n  — *  00.  If  K  ^  P,  Proposition  1  states  that  A  >  1  so  that  it  is  evident 
that  the  growth  of  the  variance  is  governed  by  f\nER[Y2h(Xo  )/h(Xn)\. 
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Note  that  R  inherits  the  irreducibility  and  aperiodicity  of  P.  Then,  we  can  assert 
that  R  has  a  unique  stationary  distribution  7Tr(-).  Let  Pr(-)  =  £#[/(•)]  be  the  probability 
distribution  on  X  associated  with  initial  distribution  rj  and  transition  matrix  R.  We 
can  apply  Theorem  4  of  Niemi  and  Nummelin  (1982)  to  conclude  that  there  exist  finite 
constants  0  =  Ylx  nR(x)fix )  and  o2r  such  that  for  any  t  €  R  and  z,  y  6  S, 

P*{n*(y„/n  -P)<t,Xn  =  z\XQ  =  z) 

->P{N(0,*2R)<t}«R(z) 

as  n  —*  oo,  where  jV( 0,  <rR )  is  a  normal  r.v.  with  mean  zero  and  variance  a2R.  Since  (3.2) 
holds  for  all  z,  it  is  evident  that 

Pr{X o  =  z,  n»(K„/n  -/?)<*,  Xn  =  z} 

-*  »7(a:)P{W(0,a^)  <  <}  n rR(z) 

as  n  —*  oo.  Note  that  the  r.v.’s  Xq ,  n  a  (Y„/n  —  (3),  and  Xn  are  all  asymptotically  indepen 
dent  under  Pr(-).  The  continuous  mapping  principle,  together  with  a  converging- toget he ■ 
argument  (see,  for  example,  Billingsley  (1968)),  therefore  proves  that  under  the  distribu 
tion  Pr(-), 

(3-4)  ( Yn/n)2h(X0)/h(Xn )  =►  / 32h(X0)/h(Xoo ) 

where  Xqq  is  a  r.v.  with  mass  function  ttr  and  Xo  and  Xoo  are  independent.  Since 
\h(X0)/h(Xn)\  <  ma x{h(x)/h(y)  :  z,y  <=  5}  <  oo,  we  can  apply  the  bounded  convergence 
theorem  to  (3.4),  yielding 

(3-5)  n-2ER[Y2h(X0)/h(Xn )]  -  /32Kik2, 

where  rtj  =  r}(x)h(x)  and  k2  =  ttr(z)//i(z).  If  0  0,  (3.5)  provides  the  asymptotic 

estimate  that  we  need  (since  the  strict  positivity  of  h  implies  that  k\,k2  >  0).  If  13  =  0, 
we  note  that  (3.3)  proves  that  under  Pr, 

(Yn/ni)2h(X0)/h(Xn)  =>  <t2rN (0,  l)2h(Xo)/h(X00), 
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where  iV(0,  l),Xo,  and  Xoo  are  independent  r.v.’s.  Theorem  3,  p.  102,  of  Chung  (1967), 
together  with  the  boundedness  of  h(Xo)/h(Xn),  then  yields  the  conclusion  that  if  j3  =  0, 

n~lER[Y2h(X0)/h(Xn)}  ->  cr|j/c1«2 

as  n  — »  oo.  We  can  summarize  our  discussion  thus  far  with  the  following  theorem. 

THEOREM  2.  Suppose  P  is  irreducible  and  aperiodic.  Then, 
i)  if  (3  #  0  and  P  ^  K, 

varAr[ynI„]  ~  n2'y\n/32 kxk2 


as  n  — >  oo; 

ii)  if  P  —  0,  P  ^  K,  and  cr2 1  >  0, 


var^fKIn]  ~  n'y\n<T2RKlK2, 


as  n  — ►  oo. 


Typically,  we  would  expect  /3  ^  0  to  hold  in  most  practical  settings,  in  which  case  the 
variance  grows  as  n2  An. 

It  is  instructive  to  also  describe  the  asymptotic  distribution  of  the  r.v.  YnLn.  Let 
Pk(-)  =  jE'/d-fC')]  be  the  probability  distribution  on  X  associated  with  initial  distribution 
v  and  transition  matrix  K.  Then, 


log(in)  =  log 


v(Xo) 


U(*o) 


n— 1 


+  £i°g 


»=0 


P(X„X,+l) 

[K(Xi,Xi+l) 


Applying  the  law  of  large  numbers  to  the  finite  chain  (Xn,X„+i),  we  conclude  that 
-  log  Ln  -»  itk(x)K(z,  y)  log  ^  P^X' 


*.» 


I  *'(*■!/) 


Pk  a.s.  as  n  -*  oo.  If  P  £  K,  the  strict  concavity  of  the  log  function  implies  that 


£  <  log 


y%fc-(r)fc(x,y) 


Li.V 


P(x,y) 


=  log  1  =  0, 
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so  that  in  this  case, 

(3.6)  Ln  —>  exp(£)  <  1. 

Pk  &-s.  as  n  — ►  oo.  For  an  arbitrary  function  g ,  let  ||^||  =  sup{|<7(x)|  :  x  G  5}.  Since 
\Yn\  <  n  •  ||/||,  it  is  evident  that  |y„£n|n  <  n»  [|/||n£,£,  so  that  (3.6)  yields 

(3.7)  n^„|ynLnj±  <  1 
PK  a.s.  Fix  0  <  £  <  1.  If  lim|ynLn|  >  e,  then 

lim„|y„Ln|  »  >  lim  £"  =  1, 

contradicting  (3.7).  Thus,  (3.7)  implies  that  lim|y„L„|  =  0,  so  that 

(3.8)  YnLn  -+  0 
Pk  a.s.  as  n  — *  oo,  whenever  P  ^  K. 

This  gives  us  a  more  complete  description  of  the  asymptotic  behavior  of  the  r.  v.  Y„L„ 
While  (3.7)  and  (3.8)  state  that  YnLn  is  very  small  (with  high  probability)  when  n  is  large 
Theorem  2  asserts  that  when  YnLn  is  large  (with  small  probability),  it  must  be  extremely 
large  (in  order  that  the  variance  grow  geometrically  fast).  Thus,  for  large  n,YnLn  is  a 
r.v.  that  takes  on  extremely  large  values  with  very  small  probability  and  small  values  with 
very  high  probability. 

4.  APPLICATION  2:  TERMINAL  COSTS 

As  in  Section  3,  let  /  be  a  real-valued  function  defined  on  the  state  space  5  of  the 
Markov  chain  X.  In  this  section,  we  are  concerned  with  the  terminal  cost  corresponding 
to  the  function  /,  namely 

y„  =  nxn) 

Assume  P  is  irreducible  and  aperiodic.  We  may  then  apply  Theorem  1  to  obtain  the 
identity 

(4-1)  var*-[ynL„]  =  'r\nER[f2(Xn)h(X0)/h(Xn)}  -  a2n. 
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where  an  =  EpYn.  Since  R  is  aperiodic  (by  virtue  of  the  aperiodicity  of  P),  it  is  evident 
that  for  each  x  €  S, 


(4.2)  Pit{Xn  -  j/|X0  =  x}  -►  irR(y) 
as  n  — *  oo.  Relation  (4.2)  implies  that 

(4.3)  Pr{X0  =  x,Xn  =  y}  -*  r](x)irR{y) 

as  n  — ►  oo.  Thus,  X0  and  Xn  axe  asymptotically  independent  r.v.’s.  Applying  the  contin¬ 
uous  mapping  principle  to  (4.3),  we  conclude  that 

(4.4)  f(X„)h(X0 )/h(X„)  =>  f(Xx)h(X0)/h(Xx.) 

as  n  — >  oo  (in  PR  distribution),  where  Xo  and  X<x>  are  independent  r.v.’s  (with  Xoo  having 
distribution  nR).  The  bounded  convergence  theorem  then  implies  that 

ER[f2(Xn)h(X0)/h(Xn))  -  Kik3 

as  n  — ♦  oo,  where  /ci  =  v(x)h(x)  and  k3  =  vR(x)f2(x)/h(x).  This  discussion  has 

yielded  the  following  theorem.  (Note  that  an  is  bounded.) 

THEOREM  3.  Suppose  P  is  irreducible  and  aperiodic.  If  /  ^  0  and  K  ^  P,  then 

var K[f(Xn)Ln\  ~  y\nKlK3 


a an—*  oo. 

The  argument  employed  in  Section  3  to  study  the  “almost  sure”  behavior  of  YnLn  is 
equally  applicable  here.  In  particular,  if  K  ^  P,  we  may  conclude  that 

f(Xn)Ln  — >  0 

PR  a.s.  as  n  — ♦  oo.  As  in  the  caise  of  cumulative  costs,  it  therefore  follows  that  when  the 
time  horizon  n  is  large,  f(Xn)Ln  takes  on  extremely  large  values  with  small  probability 
and  very  small  values  with  high  probability. 
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5.  APPLICATION  3:  STEADY-STATE  COSTS 

In  this  section,  we  apply  the  results  of  Section  3  (on  cumulative  costs)  to  the  analysis 
of  importance  sampling  for  steady-state  costs.  Given  a  real-valued  function  /  defined  on 
the  state  space  of  X,  let 

n  =  £/(*») 

k—0 

be  the  cumulative  cost  corresponding  to  the  function  /.  Let  Pp(-)  =  Epl(-)  be  the 
probability  distribution  on  X  associated  with  initial  distribution  p  and  transition  matrix 
P.  Assume  that  P  is  irreducible  and  aperiodic.  Then,  the  law  of  large  numbers  applies  to 
Yn  and  we  may  assert  that 

(5.1)  Yn/n-*^2nP(x)f(x)  =  r 

X 

Pp  a.s.  as  n  — ►  oc,  where  irp(-)  is  the  (unique)  stationary  distribution  of  P.  The  constant 
r  may  therefore  be  interpreted  as  the  steady-state  cost  associated  with  the  performance 
measure  /.  We  note  that  the  bounded  convergence  theorem  applies  to  (5.1),  yielding 

Ep[Yn/n }  -  r 

as  n  — >  oo.  Since  EpY„  =  ExLnYn,  it  follows  that 

EK[YnLjn\  -  r 

as  n  — ♦  oo.  Hence,  the  r.v.  YnLn/n  (when  generated  under  Pk )  can  be  used  as  an  estimator 
for  the  steady-state  mean  r. 

In  particular,  suppose  that  T  represents  the  computer  budget  available  to  estimate 
the  steady-state  mean  r.  To  simplify  our  analysis,  we  assume  that  exactly  one  transition 
of  the  chain  X  is  generated  per  unit  time.  Given  the  budget  T,  we  can  generate  (under 
Pk)  m  =  m(T)  replicates  of  the  chain  X ,  each  of  length  n(T),  where  n(T)  =  LT/m(T)J. 
This  results  in  the  estimator 

,  m(T) 

r(T) 

1  =  1 
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where  Yin(T)Lin(T)  is  the  i’th  independent  replicate  of  the  r.v.  Yn(T)Ln(T)-  The  mean 
square  error  (MSE)  of  r(T)  is  given  by 

(5.2)  MSEK[r(T)]  =  var  *[r(T)]  +  bias*  [r(T)])2, 

where  bias#  [r(T)j  =  Ekt(T )  —  r.  We  note  that 

VaTK-[r(T)]  =  ^~-yBIK[Yn(,T)Ln(T)] 
bias*r[r(T)]  =  Ep\Yn(T)/n{T )]  -  r. 

In  our  subsequent  analysis  of  (5.2),  we  shall  assume  that  n(T)  — ►  oo  as  T  — ►  oo.  (In 
virtually  all  practical  applications,  this  is  necessary  in  order  that  the  MSE  converge  to 
zero  a s  T  —*  oo.) 

Now,  Theorem  2  states  that  if  P  ^  K  and  0  ^  0,  then 

(5.3)  vartfKT)]  ~  n(T)A"<T>/T. 

To  analyze  the  bias  term,  we  recall  that  we  are  assuming  that  P  is  aperiodic  and  irreducible. 
Then,  it  is  well  known  that  Epf(Xn)  — ►  r  geometrically  fast,  and  hence 

6  =  -  r) 

n=0 

converges  absolutely.  Hence,  there  exists  p  €  [0, 1)  such  that 

OO 

EP[YJn\  -r  =  hln-  -  r)/n 

>=n 

=  b/n  +  0(pn) 

as  n  — ►  oo.  Thus,  if  6  ^  0  (as  is  typical  in  most  applications),  it  is  evident  that 

(5.4)  (biMK(r(T)])!  ~  b2ln\T) 

as  T  —*  oo.  The  following  theorem  is  easily  verified  from  the  asymptotic  formulae  (5.3) 
and  (5.4). 

THEOREM  4.  Suppose  that  P  is  irreducible  and  aperiodic.  Assume  6^0,  0  0, 

P  ^  K,  and  that  n(T)  — »  +oo. 
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i)  If  n(T)  —  (logT  -  log  log  T)/ log  A  — >  +00  as  T  -*  oo,  then 


var  K-[r(T)]  — ►  +oo,  bias/c[r(T)]  -*  0 


as  T  — ►  oo. 

ii)  If  n(T)  —  (logT  —  3  log  log  T)/' log  A  — ►  +oo  as  T  — ►  oo,  then 

(log  T)2var/£-[r(T)]  -*  oo,  lim  (lqgT)2bias*[r(r)]  <  (logA)2&2 

as  T  — ♦  oo. 

iii)  If  n(T)  —  (log  T  —  3  log  log  T)/  log  A  — »  a  as  T  — *  oo,  then 

(logT)2var^[r(T)]  -  7/?2*i*2ea,  (log r)2bias^[r(T)]  -  (logA)2fe2 
as  T  —+  oo. 

iv)  If  n(T)  —  (logT  —  3  log  log  T)/ log  A  — ►  — oo  as  T  — ►  oo,  then 

(logT)2vari<-(r(T)]  — ►  0,  lim  (logT)2bias^[r(T)]  >  (logA)262 

as  T  — +  oo. 

v)  If  n(T)  ~  clogT,  where  0  <  c  <  1/logA,  then 

(logT)2varif[r(r)]  0,  (logT)2bias^[r(T)]  62/c2 


as  T  — ♦  oo. 

This  result  has  a  number  of  important  implications.  Firstly,  we  note  that  part  i)  states 
that  if  n(T)  —  (logT  —  log  logT)/ log  A  — ►  oo,  MSEi<r(T)  —*  oo,  so  that  if  the  replication 
run-length  n(T)  grows  too  quickly,  the  mean  square  error  need  not  even  converge  to  zero. 
This  is  particularly  problematic  in  light  of  the  fact  that  log  A  is  typically  unknown. 

Parts  ii)  through  iv)  suggest  that  n(T)  «  (logT  —  3  log  logT)/ log  A  +  a  defines  the 
critical  case  in  which  both  components  of  the  mean  square  error  (namely  variance  and  the 
squares  bias)  go  to  zero  at  the  same  rate,  namely  (logT)-2. 

Finally,  the  theorem  (taken  as  a  whole)  suggests  that  the  best  possible  convergence 
rate  for  the  root  mean  square  error  of  r(T)  is  (logT)-1  as  T  — ♦  oo.  Thus,  the  convergence 
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rate  of  the  replicated  steady-state  importance  sampling  estimator  is  exceptionally  slow 
(when  compared  to  the  rate  of  T“*  which  is  achieved  for  typical  steady-state  simulations 
when  implemented  without  importance  sampling). 

The  poor  convergence  rate  of  r(T)  basically  arises  because  of  the  exponential  growth 
in  the  variance  of  YnLn.  One  way  to  (partially)  avoid  this  is  to  use  the  regenerative 
structure  that  is  present  in  finite-state  Markov  chains.  This  allows  us  to  reduce  the  time 
horizon  to  that  of  a  regenerative  cycle.  Specifically,  select  a  regeneration  state  z  and  let 
r  =  inf{n  >  1  :  Xn  —  z}.  Thus,  if  we  set  fi  =  6Z  (i.e.  a  unit  point  mass  at  z),  we  recall 
that  the  steady-state  cost  r  can  be  represented  as 


(5.5) 


r  =  EpYr/EpT 
■=  EK[YrLr]/Ep[TLr\. 


As  suggested  in  Glynn  and  Iglehart  (1989),  one  can  use  importance  sampling  to  estimate  r 
via  the  ratio  formula  (5.5).  Let  (Ylr,  rx ,  Lir ),  ( Y2t ,  r2,  L2t ), ...  be  a  sequence  of  i.i.d.  repli¬ 
cates  of  (Yr,  r,  Lr)  (generated  under  Pk)-  Suppose  that  the  time  required  to  generate  the 
i’th  cycle  is  r,  and  let  N(T)  =  inf{n  >  0  :  5Zr=i  r«  —  r)  be  the  number  of  cycles  completed 
in  T  units  of  computer  time.  Let  r'(T)  be  the  ratio  estimator  available  after  T  units  of 
computer  time  have  been  expended,  namely 


r'(T)  =  { 


N(T) 


1=1 


N(T) 

J2TiLi 

i=i 


;  N(T)  >  l 


l  0  ;  N(T)  =  0. 

In  Glynn  and  Iglehart  (1989),  it  is  shown  that  if  ErZ2  <  oo,  where  Z  =  (Yr  —  rr)Lr,  then 


r*(r'(r)  — r)=>diV(0,l) 


as  T  — ►  oo,  where  d2  =  Ekt-ErZ2 /(Epr)2.  Thus,  the  regenerative  estimator  r'(T)  enjoys 
a  convergence  rate  of  T-1^2,  provided  that  ErZ2  <  oo.  Unfortunately,  as  argued  in  Glynn 
and  Iglehart  (1989),  the  quantity  E^Z2  can  be  infinite  (even  in  the  current  finite  state 
Markov  chain  setting). 
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We  can  use  the  machinery  developed  in  this  paper  to  obtain  a  precise  analysis  of  when 
ErZ*  <  oo  will  hold.  Let  /c(x)  =  /(x)  —  r  and  observe  that  Z  =  YJ.Lr,  where 

K  =  £  /«(■*»)• 

*=0 


Then,  for  each  n  >  1,  Theorem  1  applies,  yielding 
(5.6)  EK\Yln\l{T  =  „))  =  7A”£k 

Slimming  each  side  of  (5.6)  over  n  and  using  Fubini’s  theorem,  we  get  the  identity 


,2M*o) 

n  h{xnyK  [ 


(5.7) 


ErZ2  =  7  er 


I'r-l 


M  £/<(**) 


a= o 


(We’ve  used  the  fact  that  h(X 0)  =  h(Xr)  and  that  Pr{t  <  oo}  =  Pr{t  <  oo}  =  1  in 
obtaining  (5.7)).  In  virtually  all  practical  applications,  it  will  therefore  be  necessary  tha* 
ErXt  <  oo,  in  order  that  ErZ2  <  oo.  Furthermore,  if  ErXq  <  oo  for  some  A0  > 
this  will  always  be  sufficient  to  guarantee  the  finiteness  of  ErZ2.  (To  see  this,  note  that 
XrYr 2  <  ||/c||2r2Ar  <  ||/C||2AJ  on  {r  >  no},  where  n0  is  chosen  sufficiently  large.)  Thus, 
the  finiteness  of  ErZ2  basically  comes  down  to  the  issue  of  when  the  Perron-Frobenius 
eigenvalue  A  =  A (G)  lies  in  the  interior  of  the  set  {z  E  ]R  :  Erzt  <  oo}. 

Let  A  be  the  submatrix  of  R  defined  by  A  -  ( R(u ,  v)  :  u,  v  G  5  —  {x}).  (Note  that  .4 
has  one  fewer  row  and  column  than  does  R.)  Clearly,  A  is  non- negative.  Suppose  that  .4 
is  irreducible.  Then,  A  has  a  unique  positive  Perron-Frobenius  eigenvalue  A(.4). 


PROPOSITION  2.  Suppose  that  A  is  irreducible  and  finite.  Then,  ErXt  <  oo  if  and 
only  if  A(G)  <  A(^)"1. 

For  the  proof,  see  the  Appendix. 

Thus,  the  slow  convergence  rate  of  the  replicated  steady-state  estimator  r(T)  manifests 
itself  in  the  regenerative  setting  through  the  possibility  of  infinite  variance.  If  ErZ2  = 
4-oo,  one  can  not  expect  a  convergence  rate  of  but  must  instead  expect  a  slower 

rate  of  convergence.  As  we  have  just  argued,  in  order  that  ErZ 2  <  oo,  this  will  typically 
require  that  A((7)  <  A(A)_1.  Unfortunately,  since  A(G)  and  A(A)  are  unknown  in  practical 
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settings,  this  suggests  that  a  great  deal  of  care  must  be  exercised  in  applying  importance 
sampling  using  the  regenerative  estimator  r'(T). 


6.  APPLICATION  4:  LIKELIHOOD  RATIO  GRADIENT  ESTIMATOR 


Consider  a  family  {(P(0),/i(0))  :  9  €  A}  of  transition  matrices  and  initial  distributions 
on  5  that  are  indexed  by  some  open  set  A  C  j Rk.  As  mentioned  in  the  Introduction,  the 
calculation  of  the  gradient  of  the  expected  performance  of  the  Markov  chain  with  respect 
to  9  is  a  problem  that  has  recently  attracted  considerable  attention  within  the  simulation 
community. 

One  approach  to  this  problem  is  known  as  likelihood  ratio  gradient  estimation.  In 
order  to  simplify  the  notation,  we  specialize  (without  any  essential  loss  of  generality)  to 
the  scalar  case  in  which  k  =  1.  Given  a  performance  measure  Y  =  f(X o,X\,. . .  ,Xn),  its 
expected  value  under  initial  distribution  fi(9)  and  transition  matrix  P(0)  is  given  by 

n— 1 

a(0)=  ^2  f(xo,---,Xn)n(9,x0)YlP(9,Xi,Xi+i)- 

Xo,...,Zn  i=0 

Assume  that  P(6)  and  fi(9)  are  both  continuously  differentiable  on  A.  Then,  for  80  £ 
A,  Qf'(^o)  exists  and  is  given  by 


n— 1 


a'(0o)=  /(*o, ••.,*»)  n'{0o,xo)  IJ  P(0o,Xi,r<+i)+ 


(6.1) 


Xq  ( •  ■  *»Xf] 


1=0 


n— 1 


M«0,x0  )J2p,(eo  i  xi,  £|-fl 

i=0 


In  order  to  estimate  a'(0o)  via  simulation,  it  is  necessary  to  represent  a'(0o)  as  an  expec¬ 
tation.  Suppose  that  we  select  v  such  that  v(x)  >  0  whenever  ii'(80,x  )  7^  0  or  n(80,x)  >  0 
and  select  K  so  that  K( x,y)  ±  0  whenever  P'(0o,x,y)  ^  0  or  P(90,x,y)  >  0.  An  impor¬ 
tant  observation  here  is  that,  under  our  hypotheses,  selecting  u  =  n(90)  and  K  =  P{90) 
always  fits  this  prescription.  (The  key  point  is  that  if  P'(90,x,  y)  ^  0  when  P(90,  r,  y)  =  0, 
this  implies  that  P(-,x, y)  is  strictly  negative  in  some  neighborhood  of  0o-)  Let  Es(-)  be 
the  expectation  operator  associated  with  initial  distribution  n(9)  and  transition  matrix 
P(0).  Then,  (6.1)  can  be  re-written  as 


(6.2) 


“'(flo)  =  Ee0[YL'n{9Q)\ 
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where 


,  /(Qo.Xo)  y!  f(«,,  *<■*■») 

nKo>  **(*«, *>)  £j  P(«.,x,,Xiti)' 


The  r.v.  L'n(9 o)  is  known  as  the  score  function.  We  can  now  apply  importance  sampling 
(see  formula  (1.1))  to  (6.2),  thereby  yielding 


a'(00)  =  EK[YL'n(6o)Ln], 


where 


p(e0,Xj,xi+1) 

"  *<*•)  fj  KiXi'Xi+x)  • 

The  estimator  based  on  Y  L'n(6o)Ln  (when  generated  under  Pr)  is  called  the  likelihood 
ratio  gradient  estimator. 

Suppose  P(0O)  is  irreducible  and  finite.  By  Theorem  1,  we  arrive  at  the  identity 

(6.4)  var* [yX'n(0o)I„]  =  1 r\nER[Y2L'n(60)2h(X0)/h(Xn )]  -  a'(90)2 

Since  A  >  1  if  P(90)  ^  K,  (6-4)  strongly  suggests  that  the  choice  K  =  P(9Q)  minimizes 
the  variance  of  the  likelihood  ratio  gradient  estimator  when  the  time  horizon  is  large. 

To  obtain  a  more  precise  statement,  we  need  to  specify  the  performance  measure. 
Specifically,  let  us  consider  a  cumulative  cost  of  the  form 

n  =  £/(*,), 

k=0 

where  /  is  some  real-valued  function  defined  on  the  state  space  5  of  X.  To  analyze  the 
right-hand  side  of  (6.4),  we  observe  that  the  finiteness  and  irreducibility  of  R  guarantees 
that  the  following  strong  laws  will  hold: 


iy*  -  J>R(x)/(x)  ^ 

n  ' 


Pr  a.s.  as  n  — ♦  oo.  As  in  Section  3,  one  can  show  that  Xo,  Xn,  and  (Vn,  I'n(0o ))  are  asymp¬ 
totically  independent  of  one  smother.  If  P(0O)  is  additionally  assumed  to  be  aperiodic,  then 
(6.5)  and  the  continuous  mapping  principle  implies  that 


n-WMFHXcMX.)  =>  ffVHX,  )/*(*»), 
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as  n  — ►  oo,  where  Xoo  is  a  r.v.  having  mass  function  The  bounded  convergence 

theorem  then  yields 


as  n  — ♦  oo.  We  can  summarize  our  discussion  thus  far  with  the  following  theorem. 

THEOREM  5.  Suppose  that  P(9q)  is  irreducible  and  aperiodic.  If  K  ^  P(9 0), 
0  ^  0,  and  ip  ^  0,  then 


varK[YnL'n(0Q)Ln]  ~  ~i\nn*02xp2 kxk2 


as  n  — *  oo. 

In  L’Ecuyer  and  Glynn  (1991),  it  is  shown  that  vartfo[ynZ,'n(^o)]  is  typically  of  order 
n3.  Hence,  Theorem  5  shows  that  choosing  K  ^  P(9 0)  significantly  degrades  the  variance 
(for  cumulative  cost  performance  measures)  when  the  time  horizon  is  large. 

We  conclude  this  section  with  a  brief  discussion  of  the  implications  for  optimization. 
Assume  that  P(  )  is  such  that  if  P(8,z,y)  >  0  for  some  9 ,  then  P(8,x,y)  >  0  for  all  6. 
Similarly,  assume  that  if  fi(8,  x)  >  0  for  some  0,  then  /j(0,  x)  >  0  for  all  9.  Suppose  that, 
as  suggested  in  the  Introduction,  one  simulates  the  Markov  chain  X  under  the  distribution 
P#0  associated  with  parameter  point  90  6  A.  Then,  we  can  obtain  a  globed  estimate  for 
a'(-)  by  using  (6.3): 

(6-6)  a'(0)  =  E9o[YL'n(0)Ln(0,9o)} 

where 

=  fi'(d,X o)  ^  P’(9,Xi,Xi+1) 

n[  }  fi(9,X o)  P(aj1+1) 


MMo) 


MXp)  P(0,Xi,Xn.l) 
b(8 o,X0)  A-J  P(90,X„X,+iy 


Note  that  by  simulating  at  the  single  parameter  point  90,  we  can  obtain  an  unbiased 
estimate  for  a'(8)  at  each  9  €  A.  However,  as  shown  earlier,  we  can  expect  the  variance 
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of  the  estimator  for  a'(8 )  to  increase  geometrically  (in  the  length  of  the  time  horizon)  at 
rate  A(0),  where  A(0)  is  the  Perron-Frobenius  eigenvalue  of  the  matrix  G(9)  =  ( G(9 ,  x,  y )  : 
x,  y  €  5)  and  G(9,x,y)  =  P2(0,  x,  y)/P(0o,  x,  y).  Recall  that 


£G(«,z,y)-l-£ 


P(9,x,y) 

P(0o,x,y) 


T  2 


—  1 


P(00,x,y). 


Thus,  we  can  expect  that  as  the  distance  between  6  and  8q  grows,  the  row  sums  of  G{8) 
grow.  Let  6(9)  =  min  j  G(9 ,  x,  y)  :  x  €  S  j  and  note  that 


G(9)e  >  6(8)e 

where  e  is  a  column  vector  of  l’s.  Perron-Frobenius  theory  implies  that  if  P(80)  is  irre¬ 
ducible,  there  exists  a  positive  row  vector  x(6)  such  that  x(9)G(0)  =  X(6)x(9).  Hence, 

A (9)x(9)e  =  x(8)G(8)e  >  6(9)x(9)e. 

Since  x(9)e  >  0,  it  follows  that  A(0)  >  6(9).  Thus,  the  growth  of  the  row  sums  forces  A(0) 
to  grow.  So,  we  can  expect  that  in  many  practical  settings,  A(0)  will  be  large  at  points 
9  that  are  distant  from  8q.  Thus,  the  geometric  growth  problem  discussed  above  may  be 
particularly  troublesome  at  points  8  that  are  distant  from  80.  This  suggests  that  great 
care  needs  to  be  taken  with  the  development  of  optimization  algorithms  based  on  (6.6). 

7.  IMPLICATIONS  FOR  DISCRETE-EVENT  SIMULATION 

Thus  fax,  our  discussion  in  this  paper  has  focused  on  the  analysis  of  importance 
sampling,  as  it  applies  to  discrete-time  finite  state  Markov  chains.  However,  we  believe 
that  the  results  presented  here  have  obvious  analogues  in  the  more  general  discrete-event 
simulation  context. 

The  basic  idea  is  that  a  typical  discrete-event  simulation,  when  considered  on  the  time 
scale  of  state  transitions,  can  be  viewed  as  a  discrete  time  Markov  chain  living  on  a  general 
(uncountable)  state  space.  In  particular,  suppose  that  we  let  5„  represent  the  “physical 
state”  (e.g.,  the  location  of  the  customers  in  a  queuing  network,  as  described  by  a  queue- 
length  vector)  of  the  system  at  the  time  of  the  n’th  (physical)  state  transition.  Also,  let 
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Cn  be  the  “clock  reading”  vector,  at  the  time  of  the  n’th  transition.  The  ith  component  of 
Cn  then  describes  the  time  that  remains  until  the  i’th  possible  trigger  event  will  initiate  a 
state  transition.  The  key  observation  is  that  Xn  =  (5„,  Cn)  is  then  a  discrete-time  Markov 
chain;  see  Glynn  (1983)  and  Glynn  (1989)  for  further  details.  Of  course,  the  state  space 
of  this  Markov  chain  is  typically  uncountable. 

The  analysis  presented  in  this  paper  hinges  on  two  important  pieces  of  mathemati¬ 
cal  machinery.  The  first  is  the  existence  of  the  Perron-Frobenius  theory  of  non-negative 
matrices.  Fortunately,  much  of  this  theory  carries  over  to  the  more  general  setting  of  non¬ 
negative  operators  acting  on  an  abstract  state  space;  see  Chapter  5  of  Nummelin  (1984)  for 
a  recent  account  of  this  theory.  The  second  tool  that  was  repeatedly  applied  was  the  limit 
theory  for  additive  functionals  of  finite  state  Markov  chain  (e.g.,  laws  of  large  numbers 
and  central  limit  theorems).  Again,  these  results  have  a  number  of  generalizations  to  the 
uncountable  state  space  setting;  see  Niemi  and  Nummelin  (1982)  for  a  description  of  such 
results. 

Our  view  is  therefore  that,  under  suitable  regularity  hypotheses,  the  results  of  this 
paper  will  carry  over  to  the  general  state  space  setting,  and  hence  to  discrete-event  simula¬ 
tions.  For  example,  we  would  expect  (as  in  Theorem  2)  that  when  importance  sampling  is 
applied  to  a  cost  that  is  cumulated  over  the  first  n  transitions  of  the  system,  the  variance 
will  typically  grow  at  rate  n2 An  for  some  constant  A  >  1.  In  other  words,  the  results 
obtained  in  this  paper  are  qualitatively  representative  of  what  one  should  expect  in  the 
more  general  discrete-event  simulation  setting. 
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APPENDIX 


Proof  of  Proposition  1.  Consider  a  typical  row  of  the  matrix  G.  Then,  for  each 
x  6  5 


^2  G(x’  y)  - 1  =  Y1  p2(*»  y)/K(x .  y)  - 1 


(A.l) 


=  £ 


•P(*,y) 

K(x,y) 


-  1 


•K"(a5,y)  >  0. 


Let  B  =  (J3(x,  y)  :  x,  y  €  S)  be  the  stochastic  matrix  defined  by 


J9(x,  y)  =  G(x,  y)/  ^  G(x,  2), 

2 

and  note  that  G(x,y)  >  B(x,y )  for  each  x,y  £  5  (since  the  normalization  factor  that 
defines  each  row  of  B  is  greater  than  or  equal  to  1,  by  (A.l)).  By  Corollary  2.3,  p.  551, 
of  Karlin  and  Taylor  (1975),  A  =  A(G)  >  A (B),  where  A (B)  is  the  Perron- Frobeniu? 
eigenvalue  of  B.  But  since  B  is  stochastic,  A (B)  =  1.  Hence,  A  >  1. 

As  for  7,  an  argument  similar  to  (A.l)  shows  that  7  >  1,  with  7  =  1  if  and  only  if 
1/  =  /i. 

It  remains  to  show  that  A(G)  >  1  if  P  K.  In  this  case,  at  least  one  row  sum  of  G 
is  strictly  greater  than  1.  Hence,  there  exists  at  least  one  state  x,  such  that  G(x,,y)  > 
B(x.,y)  for  each  state  y  such  that  B(x.,y)  >  0. 

Note  that  B  is  irreducible,  since  P  (and  hence  G)  is.  Since  B  is  stochastic,  there 
exists  a  unique  strictly  positive  stochastic  row  vector  (the  stationary  distribution  of  B) 
such  that  <fB  =  ip.  Furthermore,  there  exists  a  strictly  positive  column  eigenvector  h 
corresponding  to  the  Perron-Frobenius  eigenvalue  A  =  A(G)  of  G.  Since  Gh  =  A  ft,  it  is 
evident  that 


(A.  2)  yGh  =  Xtph. 

On  the  other  hand,  Gh  >  Bh  with  strict  inequality  in  the  x,’the  component.  (This  follows 
from  the  strict  positivity  of  h.)  Hence,  p>Gh  >  tpBh.  (Here,  we  make  use  of  the  strict 
positivity  of  <p.)  So,  using  the  stationarity  of  p>  and  (A. 2),  we  get 

Ay>/i  >  (pBh  =  tph. 
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We  conclude  that  A  =  A (G)  >  1  (since  the  positivity  of  <p  and  h  implies  that  iph  >  0). 


Proof  of  Proposition  2.  The  argument  is  very  similar  to  that  used  in  Section  2  to  study 
var /c[yL„].  We  note  that 


Er y  =  A  +  A  £  R(x,y)ER[\r\X0  =  y]. 


Recall  that  A  necessarily  possesses  a  strictly  positive  column  eigenvector  h'  such  that 
Ah'  —  \(A)h'.  Set  C  =  (C(u,v)  :  u,v  G  5  —  {x}),  where 


C(u,v)  = 


A(u ,  v)h'(v) 


A (A)h'(u)  • 

Arguing  as  in  Section  2,  it  is  easy  to  see  that  for  y  ^  x,  n  >  1, 

h'(Xo) 


PR{r  =  n\X0  =  y}=X(A)n-1Ec 


R(Xn-i,x) \Xa  =  y 


h'(Xn- ,) 

where  i?c(-)  is  the  expectation  operator  in  which  X  evolves  according  to  the  (stochastic) 
matrix  C.  We  therefore  arrive,  for  y  ^  i,  at  the  identity 

^«[Ar|X0=y]  =  A(A)R(y,x) 


(A.  3) 


+  A(A)^[A(G)A(A)]"£c 


n=0 


h'(X q) 

k'(Jfn) 


R(X„,x)|X0  =  y 


Note  that  the  expectations  Ec(-)  appearing  in  (A. 3)  are  bounded  above  by 
max{/i'(u)R(i;,i)//i'(u)  :  u,v  G  5-  {x}}  so  that  if  A(G)A(A)  <  1,  (A.3)  clearly  con¬ 
verges.  On  the  other  hand,  there  exists  a  sequence  of  the  form  I  =  {i  +  jm  :  m  >  0} 
(depending  on  the  periodicity  of  X  under  C)  such  that  R( Xn,x)  is  bounded  away  from 
zero  for  n  €  I.  Since  h'  is  strictly  positive,  this  guarantees  that  the  expectations  Ec(-)  are 
bounded  away  from  zero  on  the  subsequence  I.  Thus,  in  order  that  (A.3)  converge,  it  is 
necessary  that  A(G)A(A)  <  1. 
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